Program Testing
The set of files to be preprocessed is available in this compressed tarfile or this files directory. For initial testing, copy a few of these files into your home directory for
processing. For final testing, use the full path
to these files as the input and your own path for the output to conserve
disk space. There is about 12 megs of data, and managing data within your quota is your business. You are free to store the files on your own machine.
Program Documentation.
After your internal documentation (comments) are complete, write a
report that provides a short executive summary of your program.
In particular, discuss how you handled punctuation and numbers, and describe how
you calculated the frequency of each word. Identify some HTML constructs or words which are
incorrectly tokenized (if any) and discuss why your program does not handle
them properly. Also, discuss the efficiency of your frequency program in
terms of order of magnitude and timings (cpu time, elapsed time).
Include a small graph or table of time versus number of documents processed.
The entire document should be no more than three pages in length.
We will primarily be grading from the report, so make sure it clearly describes what you did and your program's output and efficiency.
You may work with your partner on implementing these programs, or you can implement this on your own.
We are providing a stoplist, but you won't need it for this phase of the project.
Hand In
Your code (including any shell scripts), the report, and the
first 50 and last 50 lines of the two frequency files.
Everybody turns in their own code, report, and output as described.
Late Policy
10% deduction per 24 hours. Assignments turned in during or after the
class will result in a 10% deduction.